Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an\nessential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful\nsignals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the\nbiological signal that is of interest to the researcher.\nResults: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their\ntendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene\npairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments\nderived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods\n(2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the coexpression\nmeasurements against a gold standard. Using the framework, we evaluated five batch correction methods\napplied to RNA-seq data of six representative tissue datasets derived from the GTEx project.\nConclusions: Our framework enables the evaluation of batch correction methods to better preserve the original\nbiological signal. We show that using a multiple linear regression model to correct for known confounders\noutperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available\nas an R package.
Loading....